Conversation
divyashreepathihalli
left a comment
There was a problem hiding this comment.
This is great @VarunS1997 !! left a few comments. Also, please add a colab to verify the output.
|
One additional overhead work is needed. please add keras_cv/models/feature_extractor/coca to this file PS: we will fix this overhead soon, but in the mean time this is what we need to do to make sure the large GPU tests run. |
…each layer with expected sizing
mattdangerw
left a comment
There was a problem hiding this comment.
Thanks! Left a few comments.
| import numpy as np | ||
| from keras import Sequential | ||
| from keras_cv.api_export import keras_cv_export | ||
| from keras_nlp.layers import RotaryEmbedding, TransformerDecoder |
There was a problem hiding this comment.
I think we were doing a conditional import of keras_nlp, so keras_cv was installable without keras-nlp installed if you are using unrelated features. But that was when the only use was for one tokenizer.
keras-cv/keras_cv/models/feature_extractor/clip/clip_model.py
Lines 81 to 85 in 5faae37
We could reconsider if keras-cv should hard depend on keras-nlp if we want more stuff like this? No strong feelings. @divyashreepathihalli fyi
There was a problem hiding this comment.
I think as we support more multi modal models we should depend on KerasNLP. If the tf-text install issue is resolved, we should add it.
There was a problem hiding this comment.
Sgtm! Though if we switch to a hard dependency here, we should probably add keras-nlp as a dependency in setup.py (which comes with a transitive dependency on tensorflow-text and tensorflow just fyi).
There was a problem hiding this comment.
Do we want to include that in this PR? There's already some imports of Keras-NLP in other places of Keras CV.
If we make it a separate PR, it'll make it easier to rollback if we need to. Considering it's a new dependency, might be worth separating.
|
|
||
| # [batch_size, sequence_length+1, text_dim] | ||
| text_tokens = np.concatenate(texts, self.cls_token) | ||
| mask = np.concatenate((np.ones_like(texts), np.zeros_like(self.cls_token))) |
There was a problem hiding this comment.
I can't tell if this mask is the right shape or not. Usually you want something (batch_size, seq_length, seq_length) (or seq lenth + 1 if that is the effective sequence length. What is text_dim here?
There was a problem hiding this comment.
text_dim is the dimensionality of the text embeddings
divyashreepathihalli
left a comment
There was a problem hiding this comment.
Thanks! I left a few comments
Also, please do add a colab demo to show that the model is working as expected.
keras_cv/layers/attention_pooling.py
Outdated
| super().build(input_shape) | ||
|
|
||
| self.multi_head_attn.build(input_shape) | ||
| self.layer_norm.build(input_shape) |
There was a problem hiding this comment.
is the input shape for layer_norm correct?
There was a problem hiding this comment.
Fixed it, let me know if it still doesn't look right!
| self.image_patching = PatchingAndEmbedding( | ||
| self.encoder_width, self.img_patch_size | ||
| ) | ||
| self.image_encoder = Sequential( |
There was a problem hiding this comment.
Sequential might not work, the model will not build properly.
| ) | ||
|
|
||
| self.text_embedding = RotaryEmbedding() | ||
| self.unimodal_text_decoder = Sequential( |
There was a problem hiding this comment.
Again, sequential might not work well. Please double check.
| for _ in range(self.unimodal_decoder_depth) | ||
| ] | ||
| ) | ||
| self.multimodal_text_decoder = Sequential( |
There was a problem hiding this comment.
same comment about Sequential
| num_patches = (images_shape[1] // self.img_patch_size) * ( | ||
| images_shape[2] // self.img_patch_size | ||
| ) + 1 | ||
| self.image_encoder.build((batch_size, self.encoder_width, num_patches)) |
There was a problem hiding this comment.
you could keep batch_size as None
example
self.image_encoder.build((None, self.encoder_width, num_patches))
There was a problem hiding this comment.
Just for my understanding, is there a specific reason to do that over setting the batch_size?
| contrastive loss from the ViT and the uni-modal Text Decoder is combined with a captioning loss from the multi-modal | ||
| Decoder in order to produce the combined total loss. | ||
|
|
||
| Basic Usage: |
There was a problem hiding this comment.
This has to be changed to Example: since we follow only Example or Examples: as a standard format.
|
@VarunS1997 will you be completing this one? |
What does this PR do?
Implements the work done in the "CoCa": Contrastive Captioners are Image-Text Foundation Models" (https://arxiv.org/pdf/2205.01917.pdf).
This PR requires:
Before submitting
Pull Request section?
to it if that's the case.